Search CORE

58 research outputs found

Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French

Author: FJ Och
G Sutter De
G Vanderbauwhede
Isabelle Delaere
L Macken
L Macken
Lieve Macken
M Kay
M Simard
MP Marcus
P Keirsbilck Van
PF Brown
R Moore
W Daelemans
WA Gale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

status: publishe

Lirias

Crossref

Springer - Publisher Connector

Ghent University Academic Bibliography

Syntactic discriminative language model rerankers for statistical machine translation

Author: B Roark
Christof Monz
D Chiang
F Rosenblatt
FJ Och
PF Brown
SF Chen
SI Gallant
Simon Carter
Y Freund
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

This article describes a method that successfully exploits syntactic features for n-best translation candidate reranking using perceptrons. We motivate the utility of syntax by demonstrating the superior performance of parsers over n-gram language models in differentiating between Statistical Machine Translation output and human translations. Our approach uses discriminative language modelling to rerank the n-best translations generated by a statistical machine translation system. The performance is evaluated for Arabic-to-English translation using NIST’s MT-Eval benchmarks. While deep features extracted from parse trees do not consistently help, we show how features extracted from a shallow Part-of-Speech annotation layer outperform a competitive baseline and a state-of-the-art comparative reranking approach, leading to significant BLEU improvements on three different test sets

Crossref

Springer - Publisher Connector

International Migration, Integration and Social Cohesion online publications

UvA-DARE

A Bayesian non-linear method for feature selection in machine translation quality estimation

Author: CE Rasmussen
FJ Och
J Quiñonero-Candela
K Shah
Kashif Shah
L Specia
Lucia Specia
N Meinshausen
PF Brown
R Sikes
Trevor Cohn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/01/2015
Field of study

We perform a systematic analysis of the effectiveness of features for the problem of predicting the quality of machine translation (MT) at the sentence level. Starting from a comprehensive feature set, we apply a technique based on Gaussian processes, a Bayesian non-linear learning method, to automatically identify features leading to accurate model performance. We consider application to several datasets across different language pairs and text domains, with translations produced by various MT systems and scored for quality according to different evaluation criteria. We show that selecting features with this technique leads to significantly better performance in most datasets, as compared to using the complete feature sets or a state-of-the-art feature selection approach. In addition, we identify a small set of features which seem to perform well across most datasets

Crossref

Publikationsserver der RWTH Aachen University

White Rose Research Online

University of Melbourne Institutional Repository

Predicting sentence translation quality using extrinsic and language independent features

Author: AJ Smola
Declan Groves
Ergun Biçici
FJ Och
I Guyon
I Guyon
Josef van Genabith
JS Albrecht
L Specia
L Wasserman
P Koehn
PF Brown
T Hastie
TM Cover
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/12/2013
Field of study

We develop a top performing model for automatic, accurate, and language independent prediction of sentence-level statistical machine translation (SMT) quality with or without looking at the translation outputs. We derive various feature functions measuring the closeness of a given test sentence to the training data and the difficulty of translating the sentence. We describe \texttt{mono} feature functions that are based on statistics of only one side of the parallel training corpora and \texttt{duo} feature functions that incorporate statistics involving both source and target sides of the training data. Overall, we describe novel, language independent, and SMT system extrinsic features for predicting the SMT performance, which also rank high during feature ranking evaluations. We experiment with different learning settings, with or without looking at the translations, which help differentiate the contribution of different feature sets. We apply partial least squares and feature subset selection, both of which improve the results and we present ranking of the top features selected for each learning setting, providing an exhaustive analysis of the extrinsic features used. We show that by just looking at the test source sentences and not using the translation outputs at all, we can achieve better performance than a baseline system using SMT model dependent features that generated the translations. Furthermore, our prediction system is able to achieve the

2

nd best performance overall according to the official results of the Quality Estimation Task (QET) challenge when also looking at the translation outputs. Our representation and features achieve the top performance in QET among the models using the SVR learning model

Crossref

Irish Universities

DCU Online Research Access Service

A Novel and Robust Approach for Pro-Drop Language Translation

Author: Andy Way
CN Li
CTJ Huang
FJ Och
Hang Li
Longyue Wang
M Nakamura
M Wang
N Xue
Qun Liu
R Hwa
R Quirk
Siyou Liu
Xiaojun Zhang
Zhaopeng Tu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/01/2017
Field of study

A significant challenge for machine translation (MT) is the phenomena of dropped pronouns (DPs), where certain classes of pronouns are frequently dropped in the source language but should be retained in the target language. In response to this common problem, we propose a semi-supervised approach with a universal framework to recall missing pronouns in translation. Firstly, we build training data for DP generation in which the DPs are automatically labelled according to the alignment information from a parallel corpus. Secondly, we build a deep learning-based DP generator for input sentences in decoding when no corresponding references exist. More specifically, the generation has two phases: (1) DP position detection, which is modeled as a sequential labelling task with recurrent neural networks; and (2) DP prediction, which employs a multilayer perceptron with rich features. Finally, we integrate the above outputs into our statistical MT (SMT) system to recall missing pronouns by both extracting rules from the DP-labelled training data and translating the DP-generated input sentences. To validate the robustness of our approach, we investigate our approach on both Chinese–English and Japanese–English corpora extracted from movie subtitles. Compared with an SMT baseline system, experimental results show that our approach achieves a significant improvement of++1.58 BLEU points in translation performance with 66% F-score for DP generation accuracy for Chinese–English, and nearly++1 BLEU point with 58% F-score for Japanese–English. We believe that this work could help both MT researchers and industries to boost the performance of MT systems between pro-drop and non-pro-drop languages

Crossref

Stirling Online Research Repository (RIOXX)

Stirling Online Research Repository

A massively parallel corpus: the Bible in 100 languages

Author: Christos Christodouloupoulos
CP Wei
FJ Och
M Marcus
M Potthast
Mark Steedman
P Koehn
P Resnik
T Kanungo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora

Crossref

Springer - Publisher Connector

PubMed Central

Edinburgh Research Explorer

Syntactic discriminative language model rerankers for statistical machine translation

Author: B Roark
Christof Monz
D Chiang
F Rosenblatt
FJ Och
PF Brown
SF Chen
SI Gallant
Simon Carter
Y Freund
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Creating a medical dictionary using word alignment: The influence of sources and resources

Author: FJ Och
Hans Åhlfeldt
Håkan Petersson
ID Melamed
J Foo
L Ahrenberg
LR Dice
M Merkel
M Merkel
Magnus Merkel
Mikael Nyström
MT Pazienza
Nordic Medico-Statistical Committee
P Tapanainen
PF Brown
Socialstyrelsen
Socialstyrelsen
Socialstyrelsen
Socialstyrelsen
WA Gale
World Health Organization
World Health Organization
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, ICF, NCSP, KSH97-P and parts of MeSH and how the terminology systems and type of resources influence the quality. Methods We automatically word aligned the terminology systems using static resources, like dictionaries, statistical resources, like statistically derived dictionaries, and training resources, which were generated from manual word alignment. We varied which part of the terminology systems that we used to generate the resources, which parts that we word aligned and which types of resources we used in the alignment process to explore the influence the different terminology systems and resources have on the recall and precision. After the analysis, we used the best configuration of the automatic word alignment for generation of candidate term pairs. We then manually verified the candidate term pairs and included the correct pairs in an English-Swedish dictionary. Results The results indicate that more resources and resource types give better results but the size of the parts used to generate the resources only partly affects the quality. The most generally useful resources were generated from ICD-10 and resources generated from MeSH were not as general as other resources. Systematic inter-language differences in the structure of the terminology system rubrics make the rubrics harder to align. Manually created training resources give nearly as good results as a union of static resources, statistical resources and training resources and noticeably better results than a union of static resources and statistical resources. The verified English-Swedish dictionary contains 24,000 term pairs in base forms. Conclusion More resources give better results in the automatic word alignment, but some resources only give small improvements. The most important type of resource is training and the most general resources were generated from ICD-10.</p

Publikationer från Linköpings universitet

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digitala Vetenskapliga Arkivet - Academic Archive On-line

A Cross-Lingual Similarity Measure for Detecting Biomedical Term Translations

Author: A Cichocki
C Ding
CD Manning
Danushka Bollegala
E Morin
FJ Och
Georgios Kontonatsios
GH Golub
H Wold
H Wold
K Frantzi
L Breiman
L van der Maaten
ME Tipping
N Okazaki
Neil R. Smalheiser
NT Duc
P Geladi
P Turney
PD Turney
PD Turney
R Rosipal
Sophia Ananiadou
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/06/2015
Field of study

Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)—a non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as English–French, English–Spanish, English–Greek, and English–Japanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks

University of Liverpool Repository

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

Edge Hill University Research Information Repository

PubMed Central

The University of Manchester - Institutional Repository

FigShare

Getting Past the Language Gap: Innovations in Machine Translation

Author: A Fraser
A Maletti
AB Phillips
C David
C España-Bonet
D Chiang
D Wu
F Jelinek
F Sanchez-Martinez
FJ Och
FJ Och
GS Matthew
H Somers
HM Caseli
Huang Liang Hao Zhang, Daniel Gildea, Kevin Knight
I Alegria
I Cicekli
J Bellegarda
J Bellegarda
J Hutchins
K Baker
K Baker
K Owczarzak
L Levin
M Ashburner
M Bisani
N Collier
N Habash
N Ueffing
OF Josef
OF Josef
P Cimiano
P Vossen
S Ravi
T Green
Tong Xiao
V Pekar
W Mischo
Wei Wang
WJ Hutchins
Y Wilks
Y Wilks
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

Archivio Ricerca Ca'Foscari

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari